Machine Translation Evaluation: N-grams to the Rescue
نویسنده
چکیده
Human judges weigh many subtle aspects of translation quality. But human evaluations are very expensive. Developers of Machine Translation systems need to evaluate quality constantly. Automatic methods that approximate human judgment are therefore very useful. The main difficulty in automatic evaluation is that there are many correct translations that differ in choice and order of words. There is no single gold standard to compare a translation with. The closer a machine translation is to professional human translations, the better it is. We borrow precision and recall concepts from Information Retrieval to measure closeness. The precision measure is used on variablelength n-grams. Unigram matches between machine translation and the professional reference translations account for adequacy. Longer n-gram matches account for fluency. The n-gram precisions are aggregated across sentences and averaged. A multiplicative brevity penalty prevents cheating. The resulting metric correlates highly with human judgments of translation quality. This method is tested for robustness across language families and across the spectrum of translation quality. We discuss BLEU, an automatic method to evaluate translation quality that is cheap, fast, and good.
منابع مشابه
Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics
Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM descri...
متن کاملAutomatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics
Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM descri...
متن کاملTackling Sparse Data Issue in Machine Translation Evaluation
We illustrate and explain problems of n-grams-based machine translation (MT) metrics (e.g. BLEU) when applied to morphologically rich languages such as Czech. A novel metric SemPOS based on the deep-syntactic representation of the sentence tackles the issue and retains the performance for translation to English as well.
متن کاملTruly Exploring Multiple References for Machine Translation Evaluation
Multiple references in machine translation evaluation are usually under-explored: they are ignored by alignment-based metrics and treated as bags of n-grams in string matching evaluation metrics, none of which take full advantage of the recurring information in these references. By exploring information on the n-gram distribution and on divergences in multiple references, we propose a method of...
متن کاملAutomatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics
In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring insequence n-grams automatically. The second m...
متن کامل